Skip to content

feat: Add support for speculators Eagle checkpoints #20436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

rahul-tuli
Copy link
Contributor

@rahul-tuli rahul-tuli commented Jul 3, 2025

Summary

This PR adds support for loading Eagle models converted using the speculators library's checkpoint converter. This enables vLLM to use speculators-format Eagle models for speculative decoding, bridging the gap between the speculators ecosystem and vLLM.

Technical Details

Problem Statement

The speculators library provides a unified framework for speculative decoding models, but uses a different configuration and weight naming convention than vLLM's native Eagle implementation. Key differences include:

  1. Configuration Format:

    • Speculators: {"speculators_model_type": "eagle", "transformer_layer_config": {...}, "fusion_bias": bool, "layernorms": bool}
    • vLLM: {"model_type": "eagle", "model": {...}, "eagle_fc_bias": bool, "model.add_para_norm": bool}
  2. Weight Naming:

    • Speculators uses descriptive names: fusion_fc.weight, embedding_layernorm.weight
    • vLLM uses compact names: fc.weight, enorm.weight

Solution Architecture

This implementation provides a translation layer that:

  1. Config Adapter (SpeculatorsEagleConfig):

    • Detects speculators format via speculators_model_type field
    • Translates configuration fields during model loading
    • Preserves all Eagle functionality (fusion bias, layernorms/HASS support)
  2. Weight Remapping:

    • Implemented in EAGLE.load_weights() for transparent operation
    • Maps speculators weight names to vLLM expected names
    • Handles transformer layer prefix differences
  3. Automatic Detection:

    • Modified get_config() to detect and route speculators configs
    • Maintains backward compatibility with existing Eagle models

Implementation Details

Files Modified

  1. vllm/transformers_utils/configs/speculators_eagle.py (new):

    class SpeculatorsEagleConfig(EAGLEConfig):
        @classmethod
        def from_pretrained(cls, path, **kwargs):
            # Load and convert speculators config to vLLM format
        
        @classmethod
        def _convert_speculators_to_vllm(cls, config):
            # Field mappings:
            # - fusion_bias → eagle_fc_bias
            # - layernorms → model.add_para_norm
            # - transformer_layer_config → model
  2. vllm/model_executor/models/eagle.py:

    def load_weights(self, weights):
        speculators_name_map = {
            "fusion_fc.weight": "fc.weight",
            "fusion_fc.bias": "fc.bias",
            "embedding_layernorm.weight": "enorm.weight",
            "pre_lm_head_layernorm.weight": "hnorm.weight",
        }
        # Apply mappings during weight loading
  3. vllm/transformers_utils/config.py:

    • Added speculators Eagle detection logic
    • Routes to SpeculatorsEagleConfig when detected

Testing

Models Tested

1. Standard Eagle (without layernorms)

  • Model: nm-testing/eagle-llama3.1-8b-instruct
  • Size: 481MB
  • Config: {"layernorms": false, "fusion_bias": false}
  • Status: ✅ Loads and generates correctly

2. HASS Variant (with layernorms)

  • Model: nm-testing/hass-llama3.1-8b-layernorms
  • Size: 961MB
  • Config: {"layernorms": true, "fusion_bias": false}
  • Additional weights: embedding_layernorm, pre_lm_head_layernorm
  • Status: ✅ Loads and generates correctly

Test Results

Both models successfully:

  1. Load configurations correctly with proper field mapping
  2. Map all weights properly (including layernorm weights for HASS)
  3. Generate coherent text
  4. Maintain performance characteristics

Verification Script

#!/usr/bin/env python3
"""Test all speculators Eagle models with vLLM."""

from vllm import LLM, SamplingParams

# Eagle models to test
models = {
    "Standard Eagle (local)": "/home/rahul/speculators/eagle-standard-converted",
    "Standard Eagle (HuggingFace)": "nm-testing/eagle-llama3.1-8b-instruct",
    "HASS Eagle (local)": "/home/rahul/speculators/hass-layernorms-converted",
    "HASS Eagle (HuggingFace)": "nm-testing/hass-llama3.1-8b-layernorms",
}

# Test each model
results = []
prompt = "The benefits of open source software include"

for name, eagle_model in models.items():
    try:
        # Load models
        llm = LLM(
            model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # Base model
            speculative_config={
                "model": eagle_model,  # Eagle speculator
                "num_speculative_tokens": 5
            },
            trust_remote_code=True,
            gpu_memory_utilization=0.4,
            enforce_eager=True,
            max_model_len=1024,
        )
        
        # Generate text
        output = llm.generate([prompt], SamplingParams(temperature=0, max_tokens=30))
        generated_text = output[0].outputs[0].text.strip()
        
        results.append((name, "PASSED", generated_text))
        del llm
        
    except Exception as error:
        results.append((name, "FAILED", f"Error: {error}"))

# Show results (appears after vLLM logs)
print("\n" + "="*80)
print("TEST RESULTS")
print("="*80)

for name, status, output in results:
    print(f"\n{name}: {status}")
    print(f"Output: {output}")

print("\n" + "="*80)
passed = sum(1 for _, status, _ in results if status == "PASSED")
print(f"Summary: {passed} out of {len(results)} tests passed")

# Exit with error code if any failed
exit(0 if passed == len(results) else 1)

Verification Output

================================================================================
TEST RESULTS
================================================================================

Standard Eagle (local): PASSED
Output: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

Standard Eagle (HuggingFace): PASSED
Output: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

HASS Eagle (local): PASSED
Output: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

HASS Eagle (HuggingFace): PASSED
Output: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

================================================================================
Summary: 4 out of 4 tests passed

Future Work

Phase 1: Direct Model Serving (Future PR)

Enable vllm serve <speculators_model> by automatically configuring speculative decoding:

# Future capability for both model types
vllm serve nm-testing/eagle-llama3.1-8b-instruct  # Standard Eagle
vllm serve nm-testing/hass-llama3.1-8b-layernorms  # HASS variant

# Would automatically:
# 1. Detect speculators format
# 2. Extract verifier model from config
# 3. Set up speculative decoding
# 4. Configure layernorms if present

Implementation approach:

  1. Create SpeculatorsModelConfig that reads verifier info from speculators config
  2. Modify EngineArgs to auto-configure when detecting speculators models

Phase 2: Extended Speculators Support

  • Support for other speculators model types
  • Unified config detection and routing system

Phase 3: Ecosystem Integration

  • Direct speculators library integration for on-the-fly conversion

Design Decisions

  1. Translation Layer vs. Native Support: Chose translation to maintain compatibility with existing vLLM Eagle implementation
  2. Weight Remapping Location: Implemented in load_weights() for transparency and minimal code changes
  3. Config Detection: Used speculators_model_type field as definitive indicator
  4. Backward Compatibility: All changes are additive; existing Eagle models continue to work
  5. Layernorm Support: Automatically detected via layernorms field, mapped to vLLM's add_para_norm

Performance Impact

  • No runtime overhead after model loading
  • Config translation happens once during initialization
  • Weight remapping is a simple dictionary lookup during loading

Dependencies

No new dependencies required. Uses existing vLLM and transformers infrastructure.

Checklist

  • Code follows vLLM style guidelines
  • Added config translation with proper field mappings
  • Implemented weight name remapping
  • Tested with multiple model variants (standard Eagle and HASS)
  • Maintains backward compatibility
  • Add unit tests for config conversion
  • Add integration tests
  • Update documentation

References

This is a draft PR for initial feedback.

Copy link

github-actions bot commented Jul 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @rahul-tuli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces compatibility for Eagle models that have been converted using the speculators library. It establishes a robust translation layer within vLLM to seamlessly handle differences in configuration and weight naming conventions between the speculators ecosystem and vLLM's native Eagle implementation, enabling speculative decoding with these models.

Highlights

  • Support for Speculators Eagle Models: This pull request enables vLLM to load and utilize Eagle models that have been converted using the speculators library. This bridges the compatibility gap between the speculators ecosystem and vLLM's native Eagle implementation, allowing for speculative decoding with these models.
  • Configuration Translation Layer: A new SpeculatorsEagleConfig class has been introduced. This class automatically detects speculators-specific configuration formats (e.g., speculators_model_type, transformer_layer_config, fusion_bias, layernorms) and translates them into vLLM's expected Eagle configuration format during model loading.
  • Automated Weight Remapping: Logic has been added to the EAGLE.load_weights() method to transparently remap weight names. This handles the differences between speculators' naming conventions (e.g., fusion_fc.weight, transformer.) and vLLM's internal naming scheme, ensuring that weights are loaded correctly without manual intervention.
  • Seamless Integration and Backward Compatibility: The get_config() utility has been updated to automatically detect speculators Eagle models based on their configuration. These models are then routed through the new translation layer. This integration is designed to be additive, ensuring full backward compatibility with existing vLLM Eagle models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for loading Eagle models from the speculators library by introducing a configuration adapter and weight remapping. The overall approach is sound. The review focuses on the new SpeculatorsEagleConfig implementation and identifies critical issues related to handling remote models from the Hugging Face Hub, which would prevent the feature from working in a common use case. Detailed suggestions are provided to fix these issues by properly fetching remote configurations, along with a minor suggestion to improve code maintainability.

Copy link
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

@@ -334,6 +336,17 @@ def get_config(
raise ValueError(error_message) from e

if config_format == ConfigFormat.HF:
# Check if this is a speculators Eagle model
if is_speculators_eagle_config(model):
config = SpeculatorsEagleConfig.from_pretrained(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all existing supported models just going through the PretrainedConfig pathway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@rahul-tuli
Copy link
Contributor Author

rahul-tuli commented Jul 4, 2025

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]
2025-07-04 17:14:28.671 | DEBUG    | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin
2025-07-04 17:14:29.640 | INFO     | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights
2025-07-04 17:14:29.640 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig
2025-07-04 17:14:29.641 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']
2025-07-04 17:14:29.700 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json
2025-07-04 17:14:29.701 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors
2025-07-04 17:14:30.157 | SUCCESS  | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

# UPDATE THIS PATH
eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

# Create LLM with Eagle speculative decoding
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
    speculative_config={
        "model": eagle_model_path,  # Your Eagle model path
        "num_speculative_tokens": 5,  # Number of tokens to predict ahead
    },
    trust_remote_code=True,
    gpu_memory_utilization=0.4,
    max_model_len=1024, 
)

print("Models loaded! Generating text...")

# Your prompt
prompt = "The benefits of open source software include"


sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=100,
)

# Generate text
output = llm.generate([prompt], sampling_params)[0]
generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")
print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...
Models loaded! Generating text...
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]
Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00,  2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include
Generated: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
2. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
3. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
4. Security: Open source software can be more secure
[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

@dsikka
Copy link
Contributor

dsikka commented Jul 5, 2025

Can you post steps to run?
I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]
2025-07-04 17:14:28.671 | DEBUG    | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin
2025-07-04 17:14:29.640 | INFO     | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights
2025-07-04 17:14:29.640 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig
2025-07-04 17:14:29.641 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']
2025-07-04 17:14:29.700 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json
2025-07-04 17:14:29.701 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors
2025-07-04 17:14:30.157 | SUCCESS  | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

# UPDATE THIS PATH
eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

# Create LLM with Eagle speculative decoding
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
    speculative_config={
        "model": eagle_model_path,  # Your Eagle model path
        "num_speculative_tokens": 5,  # Number of tokens to predict ahead
    },
    trust_remote_code=True,
    gpu_memory_utilization=0.4,
    max_model_len=1024, 
)

print("Models loaded! Generating text...")

# Your prompt
prompt = "The benefits of open source software include"


sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=100,
)

# Generate text
output = llm.generate([prompt], sampling_params)[0]
generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")
print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...
Models loaded! Generating text...
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]
Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00,  2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include
Generated: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
2. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
3. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
4. Security: Open source software can be more secure
[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

cli does not seem to be working on that branch.

@rahul-tuli
Copy link
Contributor Author

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]

2025-07-04 17:14:28.671 | DEBUG | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin

2025-07-04 17:14:29.640 | INFO | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights

2025-07-04 17:14:29.640 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig

2025-07-04 17:14:29.641 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']

2025-07-04 17:14:29.700 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json

2025-07-04 17:14:29.701 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors

2025-07-04 17:14:30.157 | SUCCESS | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

UPDATE THIS PATH

eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

Create LLM with Eagle speculative decoding

llm = LLM(

model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
speculative_config={
    "model": eagle_model_path,  # Your Eagle model path
    "num_speculative_tokens": 5,  # Number of tokens to predict ahead
},
trust_remote_code=True,
gpu_memory_utilization=0.4,
max_model_len=1024, 

)

print("Models loaded! Generating text...")

Your prompt

prompt = "The benefits of open source software include"

sampling_params = SamplingParams(

temperature=0.0,
max_tokens=100,

)

Generate text

output = llm.generate([prompt], sampling_params)[0]

generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")

print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...

Models loaded! Generating text...

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]

Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00, 2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include

Generated: :

  1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
  1. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
  1. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
  1. Security: Open source software can be more secure

[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

cli does not seem to be working on that branch.

Could you paste the command you used, and the trace back on that PR?

@dsikka
Copy link
Contributor

dsikka commented Jul 7, 2025

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]

2025-07-04 17:14:28.671 | DEBUG | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin

2025-07-04 17:14:29.640 | INFO | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights

2025-07-04 17:14:29.640 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig

2025-07-04 17:14:29.641 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']

2025-07-04 17:14:29.700 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json

2025-07-04 17:14:29.701 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors

2025-07-04 17:14:30.157 | SUCCESS | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

UPDATE THIS PATH

eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

Create LLM with Eagle speculative decoding

llm = LLM(

model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
speculative_config={
    "model": eagle_model_path,  # Your Eagle model path
    "num_speculative_tokens": 5,  # Number of tokens to predict ahead
},
trust_remote_code=True,
gpu_memory_utilization=0.4,
max_model_len=1024, 

)

print("Models loaded! Generating text...")

Your prompt

prompt = "The benefits of open source software include"

sampling_params = SamplingParams(

temperature=0.0,
max_tokens=100,

)

Generate text

output = llm.generate([prompt], sampling_params)[0]

generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")

print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...

Models loaded! Generating text...

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]

Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00, 2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include

Generated: :

  1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
  1. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
  1. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
  1. Security: Open source software can be more secure

[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

cli does not seem to be working on that branch.

Could you paste the command you used, and the trace back on that PR?

I’ll touch base offline. It just didn’t recognize the speculators command

transformer_config["architectures"] = [arch]

# Build vLLM config
vllm_config = {
Copy link
Contributor

@dsikka dsikka Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need to add the verifier model as part of the config? How are the two differentiated in the vllm_config object?

Copy link
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

Copy link

mergify bot commented Jul 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rahul-tuli.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 8, 2025
@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch from e9fecc1 to 8e1183c Compare July 9, 2025 13:38
@mergify mergify bot removed the needs-rebase label Jul 9, 2025
@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch 4 times, most recently from 85a11c7 to 81c9904 Compare July 9, 2025 14:30
@mergify mergify bot added llama Related to Llama models documentation Improvements or additions to documentation labels Jul 9, 2025
@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch 3 times, most recently from 9f730f4 to 875b786 Compare July 15, 2025 13:22
@aarnphm aarnphm self-assigned this Jul 15, 2025
@@ -22,6 +23,23 @@

logger = init_logger(__name__)

# Weight name mapping for speculators format compatibility
SPECULATORS_WEIGHT_MAP = {
"fusion_fc.weight": "fc.weight",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though we removed this translation in speculators?

DEFAULT_NUM_LOOKAHEAD_TOKENS = 5


class SpeculatorsEagleConfig(EAGLEConfig):
Copy link
Contributor

@dsikka dsikka Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change this from SpeculatorsConfig?

@mergify mergify bot added the performance Performance-related issues label Jul 15, 2025
rahul-tuli and others added 3 commits July 15, 2025 14:11
- Add SpeculatorsEagleConfig to handle speculators config format
- Update config loader to detect speculators Eagle models
- Add weight name remapping in Eagle model load_weights
- Support both standard Eagle and HASS (with layernorms) variants

This enables vLLM to load Eagle models converted using the speculators
library's checkpoint converter, mapping config fields and weight names
to vLLM's expected format.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove unused Any, Dict, Optional imports
- Remove unused AutoConfig import
- Keep only Union which is actually used in type annotations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Use PretrainedConfig.get_config_dict() to handle both local and HF paths
- Simplifies the code and follows best practices
- Tested with both local paths and HuggingFace model IDs

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
rahul-tuli and others added 15 commits July 15, 2025 14:11
- Set method='eagle' in vllm_config to ensure proper model detection
- This field is required by EAGLEConfig parent class
- Helps with future V1 engine compatibility

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Changed method field in vllm_config to use speculators_config.get("speculators_model_type", "eagle")
- This allows the method to be dynamically set based on the speculators model type
- Maintains backward compatibility with default value of "eagle"

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Added check for model_type == "eagle" in SpeculativeConfig auto-detection
- This ensures speculators Eagle models are properly detected and method is set to "eagle"
- Fixes V1 engine compatibility check for speculators Eagle models

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Import is_speculators_eagle_config function
- Add simple check for speculators Eagle models when method is not set
- Minimal change that handles speculators format as a special case
- Fixes issue where speculative_method was None causing V0 fallback

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Added speculators_name_map to handle fusion_fc -> fc weight remapping
- Also handles transformer.* -> model.layers.0.* prefix remapping
- Fixes KeyError for fusion_fc.weight when loading speculators Eagle models
- Similar to the remapping already added to eagle.py model

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Updated llama_eagle.py to skip transformer weights (loaded separately)
- Added num_lookahead_tokens to speculators config (required for Eagle)
- Together these fixes allow speculators Eagle models to work with V1 engine

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Explains all changes needed for speculators Eagle models
- Details the rationale behind each modification
- Includes common questions and answers
- Provides testing examples
- Documents config translation, weight remapping, and V1 detection

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Updated speculators config detection to check for speculators_model_type key
- Support both eagle and eagle3 in is_speculators_eagle_config
- Handle Eagle-3 specific config fields (draft_vocab_size, target_hidden_size)
- Infer target_hidden_size from transformer config if not provided
- Skip non-existent weights in llama_eagle to handle HASS models gracefully
- Eagle-3 models don't need weight translation (already use correct names)

This enables support for:
- nm-testing/eagle3-llama3.1-8b-instruct-speculators
- nm-testing/EAGLE3-LLaMA3.3-Instruct-70B-speculators

While maintaining backward compatibility with Eagle-1 models.

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Add RMSNorm import and support for enorm/hnorm in llama_eagle.py
- Apply layernorms in forward pass when add_para_norm is enabled
- Handle speculators weight remapping in EagleLlamaForCausalLM.load_weights
- Fixes HASS Eagle models (nm-testing/hass-llama3.1-8b-layernorms) in V1 engine

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove redundant model_type field from vllm_config (already defined in EAGLEConfig)
- Extract num_lookahead_tokens from proposal_methods in speculators config
- Add proper assertions for required speculators config structure
- Remove unnecessary intermediate variable speculators_cfg

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
This documentation is no longer needed as the implementation is complete.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove V0 engine changes from eagle.py
- Keep V1 engine support in llama_eagle.py with layernorm support
- Maintain config detection and translation for speculators format
- Ensure V1 engine compatibility for all Eagle models

This simplifies the implementation by focusing only on the modern V1 engine
which provides better performance and features.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move SPECULATORS_WEIGHT_MAP to module level to eliminate duplication
- Replace duplicate _remap_weight_name methods with single function
- Fix line continuation style to use proper parentheses
- Streamline weight loading logic while preserving functionality
- Remove verbose comments while keeping essential documentation
- Preserve original 'fc' naming convention

This consolidation improves maintainability and follows vLLM
code style conventions while preserving all existing functionality
for both Eagle-1 and Eagle-3 speculators models.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: rtuli@redhat.com
Co-Authored-By: Claude <noreply@anthropic.com>
- Add weight name mapping for speculators format compatibility
- Support HASS variant with additional layernorms
- Handle both Eagle-1 and Eagle-3 configurations
- Maintain backward compatibility with existing Eagle models

This enables using Eagle draft models packaged with the speculators
library directly in vLLM for speculative decoding.
@rahul-tuli rahul-tuli force-pushed the feat/speculators-eagle-support branch from 36b8502 to 00da923 Compare July 15, 2025 18:14
@@ -159,6 +204,11 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]):

model_weights = {}
for name, loaded_weight in weights:
remapped_name = remap_speculators_weight_name(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be under a check where we first check if there's a speculators config present in self.config?

@@ -2767,9 +2767,15 @@ def __post_init__(self):
# Automatically detect the method
if self.method in ('eagle', 'eagle3'):
pass
elif hasattr(self.draft_model_config.hf_config,
"speculators_model_type") and \
self.draft_model_config.hf_config.speculators_model_type in ("eagle", "eagle3"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this

elif "eagle-" in self.draft_model_config.model.lower() or \
"eagle3-" in self.draft_model_config.model.lower():
self.method = "eagle"
elif self.draft_model_config.hf_config.model_type == "eagle":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -22,6 +24,27 @@

logger = init_logger(__name__)

# Map speculators weight names to vLLM names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably live in the speculators config file

}


def remap_speculators_weight_name(name: str) -> Optional[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

return num_lookahead_tokens

@classmethod
def _apply_eagle_v1_config(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are fine for now but wondering if we can combine the eagle1/eagle3 _apply methods defined here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation llama Related to Llama models performance Performance-related issues speculative-decoding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants